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Abstract — Despite being invented close to sixty years ago, 
artificial neural networks (ANN) remain an area of active 
research and a powerful tool. Their resurgence in the context of 
deep learning has led to dramatic improvements in various 
domains from computer vision. The quantity of available data 
and the computing power are always increasing, which is 
desirable to train high capacity models such as Convolutional 
Neural Networks (CNN). It has been shown that CNN provide a 
high-level descriptor of the visual content of the image. In this 
paper, we investigate the use of such descriptors (convolutional 
neural codes) within the content-based image retrieval (CBIR) 
application. 

Index Terms — Feature extraction, image retrieval, neural 
network, transfer learning,semantic. 

I. INTRODUCTION 

Originally, images were manually annotated with 
keywords and text-based retrieval systems were utilized. 
However, due to the rapidly increasing size of image 
collections, manual annotation became infeasible. Therefore, 
content-based retrieval systems relying on image content only 
were developed and are heavily researched within the 
computer vision community. A separate but related to the 
image retrieval problem is the problem of image 
classification. It has been suggested that the features emerging 
in the upper layers of the CNN learned to classify images can 
serve as good descriptors for image retrieval. In particular, 
Krizhevsky et al. have shown some qualitative evidence for 
that. We measure such performance on Imagenet datasets. 

The main problem of implementing and training deep CNN 
is computational efficiency. Therefore, most implementations 
use one or more GPUs and took several days. As a result, 
pre-trained models have become quite popular (that is, the 
weights of a trained network together with a specification of 
the network architecture are shared with the community). In 
the experiments with several standard retrieval benchmarks, 
we establish that convolutional neural codes perform 
competitively even when the convolutional neural network 
has been trained for an unrelated classification task. We also 
evaluate the improvement in the retrieval performance of 
convolutional neural codes, when we implement transfer 
learning technique. 
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II. Convolutional neural network 


A. Neural networks 

The prototypical model of neural networks is the L-layer 

l_\ m (M) 

perceptron. Given the output y flO m of layer (/ -1), 
layer / computes: 

(/) r, Zx (Z) (Z-l) , (ZK -i ^ ^ (Z) /1X 

y, = f(z) = +w 0i ), l<i<m (1) 


Where / is an activation function which is applied 
component-wise. 


B. Layer types 

Similar to L-layer perceptrons, convolutional neural 
networks can be broken down into L layers. Different layer 
types are used to allow raw images as input, incorporate 
invariance to noise and distortions and accelerate training. 

- Convolutional layer 

The convolutional layer is the key-ingredient of a 
convolutional neural network as it allows handling 
multichannel images as raw input. If layer 1 is a convolutional 

layer, its input is given by m l ~ X) feature maps y (,_1) from the 
previous layer, each of size x n . Then, layer 1 
computes m ( u feature maps as: 


Y w = B in + V W w * y"" 1 ’, VI <i< m\ i: 
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Where B is a matrix of biases and W. (n is a matrix of eights 


used as discrete filter. The size m { ‘^ x m^ ] of the feature 

maps T (,) is dependent on the filter size as well as border 
effects. For 1=1, referring to the input image channels 
as F 0) ,..., T (0) , the layer directly operates on the input image. 


- Non-linearity and Rectification Layer 

A non-linearity layer applies an activation function 
/ component-wise on its input feature maps: 

T (/) = /(T (/_1) ), VI < i < m[ l) = m[ l ~ l) (3) 


Thus, the output of layer / is given by m[ l) = m'~" feature 

maps of size m^xm'' 1 =m' M) xm‘ M) . Common activation 

functions for convolutional neural networks are the logistic 
sigmoid, the hyperbolic tangent and the rectified linear 
unit/(z) = max {o, z} , we refer to the layer as rectification 
layer which can also be interpreted as separate layer. 
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- Local Contrast Normalization Layer 


A local contrast normalization layer aims to create 
competition among feature maps computed using different 
filters. Furthermore, contrast normalization layers can also be 
motivated using results from neuroscience. Krizhevsky et al. 
use brightness normalization: 


(T (/) ) r> , =- h -“-, VI < i < m[ l) 

7=1 

- Pooling layer 


(4) 
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Fig. 1 CBIR system 


Average pooling computes the average value within 
(non-overlapping) windows. 

Max pooling computes the maximum value within 
(non-overlapping) windows. 

Pooling has been found to improve convergence and 
reduce overfitting. 

- Fully connected layer 


If layer / is a fully connected layer and layer (/ -1) one of 
the above layers, the input feature maps F (/1) are interpreted 


IV. Approach 

A. Using pretrained convolutional neural codes 

The model includes five convolutional layers, each 
including a convolution, a rectified linear (ReLU), and a max 
pooling transform (layers 1, 2, and 3). At the top of the 
architecture are three fully connected layers (layer 6, layer 7, 
layer 8), which take as an input the output of the previous 
layer, multiply it by a matrix, and, in the case of layers 6, and 
7 applies a rectified linear transform. The network is trained 
so that the layer 8 output corresponds to the one-hot encoding 
of the class label. The softmax loss is used during training. 


as m^ 1} .m 3 (/ 1} -dimensional vectors and layer l computes: 
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III. Content based image retrieval 


Content-based image retrieval (CBIR), also known as 
query by image content (QBIC) is the application of computer 
vision techniques to image retrieval problem, that is, problem 
of searching for digital images in large databases [2]. It aims 
to finding images of interest from a large image database 
using the visual content of the images. "Content- based" 
means that the search will analyze the actual contents of the 
image rather than the metadata such as keywords, tags, and/or 
descriptions associated with the image. The term 'content' in 
this context might refer to colors, shapes, textures, or any 
other information that can be derived from the image itself 

[3]. 

In on-line image retrieval, the user can submit a query 
example to the retrieval system to search for desired images. 
The system represents this example with a feature vector and 
the distances (i.e., similarities) between the feature vectors of 
the query example and those of the image in the feature 
database are then computed and ranked. Retrieval is done by 
applying an indexing scheme to provide an efficient way of 
searching the image database. Finally, the system ranks the 
search results and then returns the results that are most similar 
to the query examples [4]. A typical Architecture for CBIR 
System is illustrated in Figure 1. 


Input Layer 6 Layer 7 



Fig. 2 Architecture used by Krizhevsky and al 


A. Transfert learning 

A common prescription to a computer vision problem is to 
first train an image classification model with the ImageNet 
Challenge data set, and then transfer this model’s knowledge 
to a distinct task. It allows model creation with significantly 
reduced training data and time by modifying existing rich 
deep learning models. The concept has a name: Transfer 
Learning. 

The common practice is to truncate the last layer (softmax 
layer) of the pre-trained network and replace it with our new 
softmax layer that are relevant to our own problem. 
Essentially, instead of starting the learning process from a 
(often randomly initialised) blank sheet, we start from 
patterns that have been learned to solve a different task 

Two common approach may be used: develop model 
approach and pre-trained model approach. We chose the 
second approach which consist to: 

- Select Source Model. A pre-trained source model is 
chosen from available models. Many research 
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institutions release models on large and challenging 
datasets that may be included in the pool of candidate 
models from which to choose from. We used 
Inception-v3 model (Fig. 3). 

- Reuse Model. The model pre-trained model can then be 
used as the starting point for a model on the second task 
of interest. 

- Tune Model. Optionally, the model may need to be 
adapted or refined on the input-output pair data 
available for the task of interest. 
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Fig. 3 Schematic diagram of Inception-v3 

V. System evaluation 

The system evaluation aims at measuring the performance 
of the content-based image retrieval system. The testing was 
performed on a set of 400 images consisting of Chimpanzee, 
Gorilla, Jaguar, Panthera Tigris, Tiger, Puma, Maki, Indri, 
Capuchin monkey and Macaque were taken from ImageNet 
dataset. We used images of animals that look alike on the 
physical characteristics to highlight the performance of the 
CNN. 

The performance of the system was evaluated by using 
precision over the first 12 retrieved images and recall. The 
Precision-Recall curve is a common instrument to visualize 
and understand the performance of retrieval systems. 
Furthermore, this curve can be summarized in a single value: 
Average Precision which can be interpreted as the area under 
the curve. 


^ (Re c (Z) + Pre (Z)) 

AP(Z ) = 2^ (Re C (Z)-Re c (Z)--- £ -) (6) 

2 

The evaluation results show that precision increases when 
we use transfer learning. The processing time for each 
retrieval session varied between two seconds. This implies 
that algorithms under the accuracy search node in our system 
performs better and find more relevant images. Fig. 3 shows 
an example of the achieved results when querying the system 
with a gorilla image. 


Classes 

Pretrained 

Transfer learning 

Chimpanzee 

0,676 

0,830 

Gorilla 

0,749 

0,843 

Jaguar 

0,690 

0,890 

Panthera tigris 

0,674 

0,974 

Tiger 

0,736 

0,916 

Puma 

0,545 

0,916 

Maki 

0,730 

0,833 

Indri 

0,682 

0,882 

Capuchin monkey 

0,754 

0,833 

Macaque 

0,676 

0,750 


Table 1 Average precision 



Fig. 3 Example of results 


VI. Conclusion and future work 

In this paper, we have investigated the advance of content 
based image retrieval using convolutional neural codes. The 
system has been created by using two approaches: first, using 
a pre-trained model and second, using transfer learning 
technique. The test performed on the proposed content based 
image retrieval system shows the system flexibility and 
adaptability to the user needs and to the tackled scenario. 
Future work will regard the extension of this system: first, we 
plan to enrich our system by using other visual features to 
extend the collection of usable descriptors. The second 
improvement is in the research phase where we plan to 
combine the textual and the proposed approaches to improve 
the semantic interpretation of the images. 
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